Goto

Collaborating Authors

 pitch contour


A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin

Jin, Xiaoyun, Ernestus, Mirjam, Baayen, R. Harald

arXiv.org Artificial Intelligence

We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words' meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.


Real-Time Pitch/F0 Detection Using Spectrogram Images and Convolutional Neural Networks

Zhao, Xufang, Tsimhoni, Omer

arXiv.org Artificial Intelligence

-- Pitch (also called F0 or fundamental frequency) is a very important voice feature for smart mobility features, such as driver's emotion detection, vehicle personalized profiles, and secured speaker identification. This paper presents a novel approach to de tect F0 through Convolutional Neural Networks (CNN) and image processing techniques to directly estimate pitch from spectrogram images. Our new approach demonstrates a very good detection accuracy; a total of 9 2 % of predicted pitch contours have strong or moderate correlations to the true pitch contours. Furthermore, t he experimental comparison between our new approach and other state - of - the - art CNN methods reveals that our approach can enhance the detection rate by approximately 5% across various Signal - to - Noise Ratio (SNR) conditions . Pitch detection is very widely used for smart mobility features. For example, as shown in Fig.1, pitch contour can be used to train a deep learning neural network for driver's emotion detection, which can alert road rage.


The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

Lu, Yuxin, Chuang, Yu-Ying, Baayen, R. Harald

arXiv.org Artificial Intelligence

A growing body of literature has demonstrated that semantics can co-determine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, speech rate, word position, bigram probability, speaker and word. In the GAM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.


Exploratory Study Of Human-AI Interaction For Hindustani Music

Shikarpur, Nithya, Huang, Cheng-Zhi Anna

arXiv.org Artificial Intelligence

This paper presents a study of participants interacting with and using GaMaDHaNi, a novel hierarchical generative model for Hindustani vocal contours. To explore possible use cases in human-AI interaction, we conducted a user study with three participants, each engaging with the model through three predefined interaction modes. Although this study was conducted "in the wild"-- with the model unadapted for the shift from the training data to real-world interaction -- we use it as a pilot to better understand the expectations, reactions, and preferences of practicing musicians when engaging with such a model. We note their challenges as (1) the lack of restrictions in model output, and (2) the incoherence of model output. We situate these challenges in the context of Hindustani music and aim to suggest future directions for the model design to address these gaps.


A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin

Jin, Xiaoyun, Ernestus, Mirjam, Baayen, R. Harald

arXiv.org Artificial Intelligence

In addition, Chuang et al. (2024) recently reported that the tonal contours of disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their meanings. Following up on Chuang et al. (2024) research, we present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of contextual predictors on the one hand, and the way in words' meanings co-determine pitch contours on the other hand. We analyze the F0 contours of 3824 tokens of 63 different word types in a corpus of spontaneous conversational Taiwan Mandarin, using the generalized additive (mixed) model to decompose a given observed pitch contour into a set of component pitch contours. These component pitch contours isolate the contributions to the pitch contour of the variables taken into account in the statistical model. We show that the tones immediately to the left and right of a word substantially modify a word's canonical tone. Once the effect of tonal context is controlled for, the canonical rising (T2) and dipping (T3) tones emerge as low flat tones, contrasting with T1 as a high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0), which in standard descriptions is taken to primarily depend for its realization on the preceding tone, emerges as a low tone in its own right, the realization of which is modified by the other predictors in the same way as the standard tones T1, T2, T3, and T4. In line with the results from a previous study on disyllabic words with the T2-T4 tonal contour (Chuang et al., 2024), we also show that word, and even more so, word sense, co-determine words' F0 contours, and that, as a consequence, heterographic homophones (e.g., 的, 得, and 地) have their own tonal signatures. Analyses of variable importance using random forests further supported the substantial effect of tonal context and an effect of word sense that is almost as important as that of tonal context.


Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of Tone 3 sandhi

Lu, Yuxin, Chuang, Yu-Ying, Baayen, R. Harald

arXiv.org Artificial Intelligence

In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone) when followed by another Tone 3. Previous studies have noted that this sandhi process may be incomplete, in the sense that the assimilated Tone 3 is still distinct from a true Tone 2. While Mandarin Tone 3 sandhi is widely studied using carefully controlled laboratory speech (Xu, 1997) and more formal registers of Beijing Mandarin (Yuan and Chen, 2014), less is known about its realization in spontaneous speech, and about the effect of contextual factors on tonal realization. The present study investigates the pitch contours of two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan Mandarin conversations. Our analysis makes use of the Generative Additive Mixed Model (GAMM, Wood, 2017) to examine fundamental frequency (f0) contours as a function of normalized time. We consider various factors known to influence pitch contours, including gender, speaking rate, speaker, neighboring tones, word position, bigram probability, and also novel predictors, word and word sense (Chuang et al., 2024). Our analyses revealed that in spontaneous Taiwan Mandarin, T3-T3 words become indistinguishable from T2-T3 words, indicating complete sandhi, once the strong effect of word (or word sense) is taken into account. For our data, the shape of f0 contours is not co-determined by word frequency. In contrast, the effect of word meaning on f0 contours is robust, as strong as the effect of adjacent tones, and is present for both T2-T3 and T3-T3 words.


Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music

Shikarpur, Nithya, Dendukuri, Krishna Maneesha, Wu, Yusong, Caillon, Antoine, Huang, Cheng-Zhi Anna

arXiv.org Artificial Intelligence

Hindustani music is a performance-driven oral tradition that exhibits the rendition of rich melodic patterns. In this paper, we focus on generative modeling of singers' vocal melodies extracted from audio recordings, as the voice is musically prominent within the tradition. Prior generative work in Hindustani music models melodies as coarse discrete symbols which fails to capture the rich expressive melodic intricacies of singing. Thus, we propose to use a finely quantized pitch contour, as an intermediate representation for hierarchical audio modeling. We propose GaMaDHaNi, a modular two-level hierarchy, consisting of a generative model on pitch contours, and a pitch contour to audio synthesis model. We compare our approach to non-hierarchical audio models and hierarchical models that use a self-supervised intermediate representation, through a listening test and qualitative analysis. We also evaluate audio model's ability to faithfully represent the pitch contour input using Pearson correlation coefficient. By using pitch contours as an intermediate representation, we show that our model may be better equipped to listen and respond to musicians in a human-AI collaborative setting by highlighting two potential interaction use cases (1) primed generation, and (2) coarse pitch conditioning.


Word-specific tonal realizations in Mandarin

Chuang, Yu-Ying, Bell, Melanie J., Tseng, Yu-Hsiang, Baayen, R. Harald

arXiv.org Artificial Intelligence

The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' meanings. We first show, on the basis of a Taiwan corpus of spontaneous conversations, using the generalized additive regression model, and focusing on the rise-fall tone pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of pitch realization than all the previously established word-form related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 30% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words' pitch contours and their meanings are sufficiently strong to be functional for language users. The theoretical implications of these empirical findings are discussed.


Audio Generation with Multiple Conditional Diffusion Model

Guo, Zhifang, Mao, Jianguo, Tao, Rui, Yan, Long, Ouchi, Kazushige, Liu, Hong, Wang, Xiangdong

arXiv.org Artificial Intelligence

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/


Finding Tori: Self-supervised Learning for Analyzing Korean Folk Song

Han, Danbinaerin, Repetto, Rafael Caro, Jeong, Dasaem

arXiv.org Artificial Intelligence

In this paper, we introduce a computational analysis of the field recording dataset of approximately 700 hours of Korean folk songs, which were recorded around 1980-90s. Because most of the songs were sung by non-expert musicians without accompaniment, the dataset provides several challenges. To address this challenge, we utilized self-supervised learning with convolutional neural network based on pitch contour, then analyzed how the musical concept of tori, a classification system defined by a specific scale, ornamental notes, and an idiomatic melodic contour, is captured by the model. The experimental result shows that our approach can better capture the characteristics of tori compared to traditional pitch histograms. Using our approaches, we have examined how musical discussions proposed in existing academia manifest in the actual field recordings of Korean folk songs.